2025-01-14
Our web site: https://thomaselove.github.io/432-2025/
Visit the Calendar at the top of the page, which will take you to the Class 01 README page.
Just about everything is linked at https://thomaselove.github.io/432-2025
Every deliverable is listed in the Calendar.
Assignments include two projects, seven labs, and two quizzes. Almost everything is due on Wednesdays at noon.
Project A (publicly available data: linear & logistic models)
Project B (use almost any data and build specific models)
Seven labs, meant to be (generally) shorter than 431 Labs
Lab 6 is about building or augmenting your website, and can be done now (or at any time), although it’s not due until 2025-03-26.
Syllabus, Lab Instructions provide feedback details.
We WELCOME questions/comments/corrections/thoughts!
Some source materials are password-protected. What is the password?
I broke my left leg and did a fair amount of soft tissue damage 2024-11-19, and this resulted in ankle surgery on 2024-12-11. So this semester will be weird, as I am not allowed to put weight on that foot yet, in a boot and using a walker.
Zoom link for each class is found in your email, on Canvas, and in our Shared Google Drive.
“A huge amount of effort is spent cleaning data to get it ready for analysis, but there has been little research on how to make data cleaning as easy and effective as possible….
Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.
This framework makes it easy to tidy messy datasets because only a small set of tools are needed to deal with a wide range of un-tidy datasets. This structure also makes it easier to develop tidy tools for data analysis, tools that both input and output tidy datasets. The advantages of a consistent data structure and matching tools are demonstrated with a case study free from mundane data manipulation chores.”
Read Sections 3 (Data transformation) and 5 (Data tidying)
We want:
clean_names from the janitor package to turn everything into snake_case.Jenny Bryan’s advice on “Naming Things” hold up well. There’s a full presentation at SpeakerDeck.
Good file names:
Good file names:
Avoid: spaces, punctuation, accented characters, case sensitivity
Deliberately use delimiters to make things easy to compute on and make it easy to recover meta-data from the filenames.
Don’t spend a lot of time bemoaning or cleaning up past ills. Strive to improve this sort of thing going forward.
https://quarto.org/ is the main website for Quarto.
If you can write an R Markdown file, it will also work in Quarto, by switching the extension from .rmd to .qmd.
All material for this course is written using Quarto.
our_tibble into training/test samplesWe will place 60% of the penguins in our training sample, and require that similar fractions of each species occur in our training and testing samples. We use functions from the rsample package here.
We could have used slice_sample() as in the Course Notes, too.
species n percent
Adelie 87 43.9%
Chinstrap 40 20.2%
Gentoo 71 35.9%
Total 198 100.0%
species n percent
Adelie 59 43.7%
Chinstrap 28 20.7%
Gentoo 48 35.6%
Total 135 100.0%
ggplot(data = our_train,
aes(x = species, y = bill_length_mm)) +
geom_violin(aes(fill = species)) +
geom_boxplot(width = 0.3, notch = TRUE) +
stat_summary(fill = "purple", fun = "mean",
geom = "point",
shape = 23, size = 3) +
facet_wrap(~ sex) +
guides(fill = "none") +
labs(title = "Bill Length, by Species, faceted by Sex",
subtitle =
glue(nrow(our_train), " of the Palmer Penguins"),
x = "Species", y = "Bill Length (in mm)")Analysis of Variance Table
Response: bill_length_mm
Df Sum Sq Mean Sq F value Pr(>F)
species 2 4032.7 2016.37 360.78 < 2.2e-16 ***
sex 1 770.8 770.83 137.92 < 2.2e-16 ***
Residuals 194 1084.3 5.59
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
| AIC | AICc | BIC | R2 | R2 (adj.) | RMSE | Sigma |
|---|---|---|---|---|---|---|
| 908.576 | 908.888 | 925.017 | 0.816 | 0.813 | 2.340 | 2.364 |
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.816 | 0.813 | 2.364 | 286.492 | 0 | 3 | -449.288 | 908.576 | 925.017 | 1084.258 | 194 | 198 |
m2 <- lm(bill_length_mm ~ species, data = our_train)
## anova(m2) yields p-value < 2.2e-16 (not shown here)
tidy(m2, conf.int = TRUE, conf.level = 0.90) |>
select(term, estimate, conf.low, conf.high) |>
kable(digits = 1)| term | estimate | conf.low | conf.high |
|---|---|---|---|
| (Intercept) | 39.1 | 38.6 | 39.7 |
| speciesChinstrap | 10.0 | 9.0 | 11.0 |
| speciesGentoo | 8.5 | 7.7 | 9.3 |
Parameter | m1 | m2
-----------------------------------------------------------------
(Intercept) | 36.99 (36.37, 37.60) | 39.12 (38.47, 39.78)
species [Chinstrap] | 10.06 ( 9.17, 10.95) | 10.00 ( 8.84, 11.17)
species [Gentoo] | 8.77 ( 8.03, 9.52) | 8.47 ( 7.50, 9.45)
sex [male] | 3.96 ( 3.29, 4.62) |
-----------------------------------------------------------------
Observations | 198 | 198
bind_rows(glance(m1), glance(m2)) |>
mutate(model = c("m1", "m2")) |>
select(model, r2 = r.squared, adjr2 = adj.r.squared,
AIC, BIC, sigma, nobs) |>
kable(digits = c(0, 3, 3, 1, 1, 2, 0))| model | r2 | adjr2 | AIC | BIC | sigma | nobs |
|---|---|---|---|---|---|---|
| m1 | 0.816 | 0.813 | 908.6 | 925.0 | 2.36 | 198 |
| m2 | 0.685 | 0.682 | 1012.9 | 1026.1 | 3.08 | 198 |
Which model has better in-sample performance?
m1 vs. m2 performance# Comparison of Model Performance Indices
Name | Model | AIC (weights) | AICc (weights) | BIC (weights) | R2
-----------------------------------------------------------------------
m1 | lm | 908.6 (>.999) | 908.9 (>.999) | 925.0 (>.999) | 0.816
m2 | lm | 1012.9 (<.001) | 1013.1 (<.001) | 1026.1 (<.001) | 0.685
Name | R2 (adj.) | RMSE | Sigma
--------------------------------
m1 | 0.813 | 2.340 | 2.364
m2 | 0.682 | 3.061 | 3.084
Which model has better in-sample performance?
m1 vs. m2 (training)m1_aug <- augment(m1, newdata = our_test)
m1_res <- m1_aug |>
summarize(val_R_sq = cor(bill_length_mm, .fitted)^2,
MAPE = mean(abs(.resid)),
RMSPE = sqrt(mean(.resid^2)),
max_Error = max(abs(.resid)))
m2_aug <- augment(m2, newdata = our_test)
m2_res <- m2_aug |>
summarize(val_R_sq = cor(bill_length_mm, .fitted)^2,
MAPE = mean(abs(.resid)),
RMSPE = sqrt(mean(.resid^2)),
max_Error = max(abs(.resid)))bind_rows(m1_res, m2_res) |>
mutate(model = c("m1", "m2")) |>
relocate(model) |> kable(digits = c(0, 3, 2, 2, 1))| model | val_R_sq | MAPE | RMSPE | max_Error |
|---|---|---|---|---|
| m1 | 0.831 | 1.77 | 2.29 | 6.3 |
| m2 | 0.742 | 2.31 | 2.83 | 8.3 |
Which model predicts better in the test sample?
m1 (see next 3 slides)check_model(m1): first 2 plotscheck_model(m1): next 2 plotscheck_model(m1): final 2 plotsm1) in the training sample; evaluate the quality of fit.m2) in the training sample; evaluate the quality of fit.432 Class 01 | 2025-01-14 | https://thomaselove.github.io/432-2025/